Fixes #4388: Correct transcription_delay metric calculation in STT turn detec… #4396

devbyteai · 2025-12-26T15:41:06Z

Summary

This PR fixes the incorrect transcription_delay metric calculation when using STT-based turn detection (e.g., Deepgram Flux).

Problem

When using STT turn detection mode, the transcription_delay metric incorrectly shows ~0 seconds instead of reflecting the actual transcription latency.

User-Reported Behavior:

"EOU metrics showing ~0.79 transcription_delay when should reflect actual processing time"

The metric should measure the time between when the user stopped speaking and when the transcript was received, but it was always returning near-zero values.

Root Cause

In audio_recognition.py, the transcription_delay is calculated as:

transcription_delay = max(last_final_transcript_time - last_speaking_time, 0)

The bug was in the STT END_OF_SPEECH handler (line 452), which overwrote _last_speaking_time with time.time():

elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    self._last_speaking_time = time.time()  # BUG: Overwrites the value!

Event Timeline in STT Mode (Buggy):

START_OF_SPEECH → _last_speaking_time = time.time() (correct)
FINAL_TRANSCRIPT → _last_final_transcript_time = time.time() (correct)
END_OF_SPEECH → _last_speaking_time = time.time() (BUG - overwrites!)

Since END_OF_SPEECH typically arrives shortly after FINAL_TRANSCRIPT in STT mode, both timestamps become nearly identical, resulting in transcription_delay ≈ 0.

Solution

Remove the line that overwrites _last_speaking_time at END_OF_SPEECH in STT mode. The value was already correctly set at START_OF_SPEECH.

Comparison with VAD Mode:
VAD mode does NOT update _last_speaking_time at END_OF_SPEECH - it keeps the value from the last INFERENCE_DONE event. STT mode should follow the same pattern.

After Fix:

START_OF_SPEECH → _last_speaking_time = time.time() (preserved)
FINAL_TRANSCRIPT → _last_final_transcript_time = time.time()
END_OF_SPEECH → No overwrite

Result: transcription_delay = last_final_transcript_time - last_speaking_time now correctly represents the actual transcription latency.

Testing

All 15 existing agent session tests pass:

tests/test_agent_session.py::test_events_and_metrics PASSED
tests/test_agent_session.py::test_tool_call PASSED
tests/test_agent_session.py::test_interruption[False-5.5] PASSED
tests/test_agent_session.py::test_interruption[True-5.5] PASSED
tests/test_agent_session.py::test_interruption_options PASSED
tests/test_agent_session.py::test_interruption_by_text_input PASSED
tests/test_agent_session.py::test_interruption_before_speaking[False-3.5] PASSED
tests/test_agent_session.py::test_interruption_before_speaking[True-3.5] PASSED
tests/test_agent_session.py::test_generate_reply PASSED
tests/test_agent_session.py::test_preemptive_generation[True-0.8] PASSED
tests/test_agent_session.py::test_preemptive_generation[False-1.1] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-2.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-2.0] PASSED

======================== 15 passed in 75.96s ========================

Backward Compatibility

No breaking changes - This fix only corrects the metric calculation. The actual agent behavior (speech recognition, turn detection, interruption handling) is completely unchanged.

Expected Impact:

Users with STT turn detection will now see accurate transcription_delay values in their metrics
Dashboards showing this metric will now report correct latency (previously under-reported as ~0)

Edge Cases Handled

No VAD present - Already handled at lines 376-382, falls back to STT timestamps
Multiple speech segments - START_OF_SPEECH updates _last_speaking_time for each new segment
Preflight transcripts - Also update _last_final_transcript_time correctly
VAD mode unchanged - Fix only affects STT turn detection mode

Files Changed

livekit-agents/livekit/agents/voice/audio_recognition.py

Removed the buggy self._last_speaking_time = time.time() line from END_OF_SPEECH handler
Added explanatory comment documenting why we don't update the timestamp here

Related Issues

Issue min_endpointing_delay behaves differently in VAD vs STT turn detection mode #4325: min_endpointing_delay behavior differences between VAD and STT modes (related timing inconsistency)

Summary by CodeRabbit

Bug Fixes
- Improved handling of initial speech timestamps, resulting in more accurate latency measurements and more reliable end-of-turn detection during voice recognition.
- Reduces premature or inconsistent end-of-utterance decisions, especially when voice activity detection is active.
- No other user-visible behavior changes beyond improved timing and reliability.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

davidzhao · 2026-01-15T19:07:36Z

@codex review

chatgpt-codex-connector · 2026-01-15T19:11:51Z

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

davidzhao

@devbyteai please remove PR_DESCRIPTION.md from the commit

livekit-agents/livekit/agents/voice/audio_recognition.py

coderabbitai · 2026-01-21T09:09:40Z

📝 Walkthrough

Walkthrough

Replaced sentinel checks for _last_speaking_time so END_OF_SPEECH and STT event paths compare against None and only set _last_speaking_time when VAD is disabled or it was previously None, preventing unintended overwrites of prior speaking timestamps.

Changes

Cohort / File(s)	Change Summary
STT turn-detection timing `livekit-agents/livekit/agents/voice/audio_recognition.py`	Use `None` as the uninitialized sentinel for `_last_speaking_time`; change conditions in FINAL_TRANSCRIPT, PREFLIGHT_TRANSCRIPT, and END_OF_SPEECH paths to avoid overwriting an existing speaking timestamp when VAD is used.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I measured the hop, I kept the first chime,
No overwrite at ending, I saved the first time.
VAD gave its word, STT kept the tune,
Now latency sings true beneath the moon. 🕰️✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title correctly identifies the fix for issue `#4388` regarding transcription_delay metric calculation in STT turn detection mode.
Linked Issues check	✅ Passed	The pull request addresses the primary objective from `#4388` by replacing the overwriting of _last_speaking_time at END_OF_SPEECH with None checks, ensuring transcription_delay accurately reflects the interval between speaking and transcription.
Out of Scope Changes check	✅ Passed	All changes are within scope; the PR only modifies sentinel value checks from 0 to None in STT event handling to correct transcription_delay calculation as required by `#4388`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e2ffb6 and 0cee66e.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (1)

livekit-agents/livekit/agents/voice/audio_recognition.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/voice/audio_recognition.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (3)

livekit-agents/livekit/agents/voice/audio_recognition.py (3)

376-382: Correct sentinel check for preserving VAD timing.

The change from == 0 to is None properly matches the type annotation (float | None) and initialization. The conditional logic now correctly:

Preserves VAD-provided _last_speaking_time when VAD is enabled

Falls back to STT timing when VAD is disabled or hasn't fired yet

428-430: Consistent sentinel check in PREFLIGHT_TRANSCRIPT handler.

Matches the pattern used at line 376, ensuring VAD timing is preserved here as well.

446-456: Core fix: END_OF_SPEECH no longer unconditionally overwrites the speaking timestamp.

This directly addresses issue #4388. Previously, END_OF_SPEECH (which can arrive immediately after FINAL_TRANSCRIPT in STT mode) was overwriting _last_speaking_time, causing transcription_delay ≈ 0.

With this fix:

When VAD is present: the timestamp from VAD's INFERENCE_DONE is preserved

When VAD is absent: STT timing is used as the fallback (with acknowledged limitations per the TODO at line 378)

This aligns STT mode behavior with VAD mode, which does not overwrite _last_speaking_time at END_OF_SPEECH.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@livekit-agents/livekit/agents/voice/audio_recognition.py`:
- Around line 452-453: The condition uses `self._last_speaking_time == 0` but
`_last_speaking_time` is initialized to None and reset to None, so replace
comparisons to 0 with explicit None checks; update the three spots in
audio_recognition.py where you see `if not self._vad or self._last_speaking_time
== 0` (and similar at the other two locations) to `if not self._vad or
self._last_speaking_time is None` so START_OF_SPEECH only sets the timestamp
when it truly hasn't been set, and END_OF_SPEECH/other branches behave
correctly; ensure you update all occurrences that reference
`_last_speaking_time` in the relevant methods to use `is None`.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f4fcef and 0e2ffb6.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (1)

livekit-agents/livekit/agents/voice/audio_recognition.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

livekit-agents/livekit/agents/voice/audio_recognition.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: unit-tests
GitHub Check: type-check (3.13)
GitHub Check: type-check (3.9)

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

livekit-agents/livekit/agents/voice/audio_recognition.py

…tion mode Fixes livekit#4388 Remove the line that overwrites _last_speaking_time at END_OF_SPEECH in STT mode. This was causing transcription_delay to always be ~0 since END_OF_SPEECH typically arrives after FINAL_TRANSCRIPT, making both timestamps nearly identical.

davidzhao

lg!

davidzhao requested a review from longcw January 15, 2026 19:26

davidzhao reviewed Jan 15, 2026

View reviewed changes

longcw reviewed Jan 16, 2026

View reviewed changes

livekit-agents/livekit/agents/voice/audio_recognition.py Outdated Show resolved Hide resolved

devbyteai force-pushed the fix/transcription-delay-stt-mode branch from ee893ae to 0e2ffb6 Compare January 21, 2026 09:09

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

livekit-agents/livekit/agents/voice/audio_recognition.py Outdated Show resolved Hide resolved

devbyteai force-pushed the fix/transcription-delay-stt-mode branch from 0e2ffb6 to 0cee66e Compare January 21, 2026 09:21

davidzhao approved these changes Jan 21, 2026

View reviewed changes

davidzhao merged commit 73eb202 into livekit:main Jan 21, 2026
10 checks passed

coderabbitai bot mentioned this pull request Jan 27, 2026

prevent duplicate false interruption due to late end of speech #4621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #4388: Correct transcription_delay metric calculation in STT turn detec… #4396

Fixes #4388: Correct transcription_delay metric calculation in STT turn detec… #4396

Uh oh!

devbyteai commented Dec 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

davidzhao commented Jan 15, 2026

Uh oh!

chatgpt-codex-connector bot commented Jan 15, 2026

Uh oh!

davidzhao left a comment

Uh oh!

Uh oh!

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

davidzhao left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixes #4388: Correct transcription_delay metric calculation in STT turn detec… #4396

Fixes #4388: Correct transcription_delay metric calculation in STT turn detec… #4396

Uh oh!

Conversation

devbyteai commented Dec 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Solution

Testing

Backward Compatibility

Edge Cases Handled

Files Changed

Related Issues

Summary by CodeRabbit

Uh oh!

davidzhao commented Jan 15, 2026

Uh oh!

chatgpt-codex-connector bot commented Jan 15, 2026

Uh oh!

davidzhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidzhao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devbyteai commented Dec 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 21, 2026 •

edited

Loading